feat(lora): save/restore LoRA config in checkpoint metadata by RexBearIU · Pull Request #4269 · AI-Hypercomputer/maxtext

RexBearIU · 2026-06-25T10:35:32Z

Description

This PR implements native serialization of LoRA configuration parameters (lora_rank, lora_alpha) in standard Orbax _CHECKPOINT_METADATA files, and automatically restores them during checkpoint-to-Hugging Face conversion.

Why is this change being made?

Previously, users had to manually supply matching lora.lora_rank and lora.lora_alpha parameters when converting MaxText checkpoints to Hugging Face format. Storing them in Orbax metadata makes the conversion seamless and error-free (resolves @igorts-git's request in #3970).

Key Implementation Details

Serialization: In save_checkpoint (checkpointing.py), we save the active config.lora block under the "lora" key in Orbax's custom_metadata when a LoRA rank is specified.
Restoration: In main (to_huggingface.py), sync_lora_metadata reads the custom metadata from lora_restore_path via ocp.StandardCheckpointer and overrides active config parameters during conversion.
Fail-Fast Safety: Scoped strictly to the conversion path to ensure SFT training paths remain strict and fail fast on any configuration mismatches.
Test Import Refactoring: Refactored hf_checkpoint_conversion_test.py to move dynamically loaded inline imports to global top-level imports and completely removed json import since JSON string is written directly.

BUGS: #3970

Tests

We have verified the implementation with complete suite-level and individual unit-tests:

Added/Updated Unit Tests:
- SyncLoRAMetadataTest in tests/unit/hf_checkpoint_conversion_test.py to verify the auto-resolving mechanism during Hugging Face conversion.
Command to run:
python tests/unit/hf_checkpoint_conversion_test.py
All tests pass successfully.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-06-25T10:39:20Z

Codecov Report

❌ Patch coverage is 67.56757% with 12 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/common/checkpointing.py	66.66%	3 Missing and 2 partials ⚠️
src/maxtext/utils/lora_utils.py	78.94%	2 Missing and 2 partials ⚠️
...rc/maxtext/checkpoint_conversion/to_huggingface.py	0.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

shralex

Thanks Jackie! A significant thing missing in this PR is using the metadata file on checkpoint restore path.

RexBearIU · 2026-06-25T15:14:08Z

Hi @shralex, thank you for the feedback!

I have fully addressed your comments with the following changes:

Checkpoint Restore Auto-Sync: Implemented automatic LoRA rank and alpha syncing from the Orbax native _CHECKPOINT_METADATA file's custom_metadata on the training/SFT restore path (restore_lora_from_path in lora_utils.py). Now, training/SFT runs resuming or restoring from a LoRA checkpoint will automatically detect, sync, and apply the correct LoRA rank and alpha parameters from the saved checkpoint metadata.
Unified Native Orbax Metadata: Switched from creating and loading a custom lora_config.json to using Orbax's native custom_metadata dictionary inside _CHECKPOINT_METADATA. This conforms perfectly to standard checkpointing conventions without introducing any custom, out-of-band config files.
Path Resilience: Enhanced metadata resolution to support paths pointing to either the step directory directly (e.g., .../checkpoints/1000/) or to any nested parameter subfolders (e.g., .../checkpoints/1000/items/), resolving parent paths gracefully.
Expanded Unit Tests & Linting: Added and modified tests (SyncLoRAMetadataTest and SyncLoRAMetadataTrainingTest in both test suites) covering both conversion and training/SFT-side auto-restore flows. Verified everything compiles, passes all pre-commit formatting/styling, and is 100% green!

Please let me know if you would like any other enhancements!

xibinliu · 2026-06-26T16:52:04Z

Thanks Jackie! A significant thing missing in this PR is using the metadata file on checkpoint restore path.

added the logic to re-use the metadata for checkpoint restore.

shralex

This version reverts Xibin's previous version where sync_lora_metadata was in lora_utils. We should move it back there and use it not just on checkpoint conversion but also before model creation.

shralex · 2026-06-29T15:33:09Z

      replicator_error_handler(config)
-      return checkpoint_manager.save(step, args=Composite(state=checkpoint_args), force=force)
+      return checkpoint_manager.save(
+          step, args=Composite(state=checkpoint_args), force=force, custom_metadata=custom_metadata


EmergencyCheckpointManager and EmergencyReplicatorCheckpointManager do not accept a custom metadata argument. Lets leave this argument out here, and open a bug to add this support

Done! Omitted passing the custom_metadata argument when calling .save() on EmergencyCheckpointManager or EmergencyReplicatorCheckpointManager.

I've created a bug b/529671188 for Orbax team to add support on EmergencyCheckpointManager or EmergencyReplicatorCheckpointManager

RexBearIU · 2026-06-30T10:05:08Z

This version reverts Xibin's previous version where sync_lora_metadata was in lora_utils. We should move it back there and use it not just on checkpoint conversion but also before model creation.

Done. Moved sync_lora_metadata back to lora_utils.py (running during checkpoint restore) with a clean, formatting-free diff.

Co-authored-by: Xibin Liu <xibin@google.com>

RexBearIU mentioned this pull request Jun 25, 2026

docs: QLoRA Documentation and Notebooks #3970

Merged

4 tasks

shralex requested changes Jun 25, 2026

View reviewed changes

RexBearIU changed the title ~~feat(lora): serialize and load lora_config.json sidecar metadata~~ feat(lora): save and auto-restore LoRA rank/alpha using native Orbax custom_metadata Jun 25, 2026

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from 187905b to cd17578 Compare June 25, 2026 15:13

shralex reviewed Jun 25, 2026

View reviewed changes

Comment thread src/maxtext/checkpoint_conversion/to_huggingface.py Outdated

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from cd17578 to 1b15640 Compare June 25, 2026 16:02

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from 1b15640 to ae44adc Compare June 25, 2026 16:11

igorts-git reviewed Jun 25, 2026

View reviewed changes

Comment thread tests/unit/hf_checkpoint_conversion_test.py Outdated

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch 3 times, most recently from 69c78a7 to a701719 Compare June 26, 2026 02:50

RexBearIU changed the title ~~feat(lora): save and auto-restore LoRA rank/alpha using native Orbax custom_metadata~~ feat(lora): save/restore LoRA config in checkpoint metadata Jun 26, 2026

igorts-git approved these changes Jun 26, 2026

View reviewed changes

xibinliu force-pushed the jackyf/lora-ckpt-metadata branch from a701719 to 07c5e19 Compare June 26, 2026 16:42

xibinliu force-pushed the jackyf/lora-ckpt-metadata branch 3 times, most recently from 5940e65 to 9bc253e Compare June 26, 2026 23:29

shralex reviewed Jun 27, 2026

View reviewed changes

Comment thread src/maxtext/utils/lora_utils.py Outdated

Comment thread src/maxtext/utils/lora_utils.py

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from 9bc253e to 0f6248b Compare June 29, 2026 09:46

shralex requested changes Jun 29, 2026

View reviewed changes

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch 3 times, most recently from d55b90d to ffe10de Compare June 30, 2026 08:33

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch 2 times, most recently from 2649217 to e58e177 Compare June 30, 2026 10:12

RexBearIU mentioned this pull request Jun 30, 2026

feat(scan_layers): verify scan_layers compatibility from checkpoint metadata #4304

Draft

4 tasks

shralex approved these changes Jun 30, 2026

View reviewed changes

github-actions Bot added the pull ready label Jun 30, 2026

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch 2 times, most recently from c3b3d77 to e8f3545 Compare July 1, 2026 08:19

feat(lora): save/restore LoRA config in checkpoint metadata

ba410a3

Co-authored-by: Xibin Liu <xibin@google.com>

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from e8f3545 to ba410a3 Compare July 1, 2026 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(lora): save/restore LoRA config in checkpoint metadata#4269

feat(lora): save/restore LoRA config in checkpoint metadata#4269
RexBearIU wants to merge 1 commit into
mainfrom
jackyf/lora-ckpt-metadata

RexBearIU commented Jun 25, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

shralex left a comment

Uh oh!

RexBearIU commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

xibinliu commented Jun 26, 2026

Uh oh!

Uh oh!

Uh oh!

shralex left a comment

Uh oh!

Uh oh!

shralex Jun 29, 2026

Uh oh!

RexBearIU Jun 30, 2026

Uh oh!

RexBearIU Jun 30, 2026 •

edited

Loading

Uh oh!

RexBearIU commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

RexBearIU commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why is this change being made?

Key Implementation Details

Tests

Checklist

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

RexBearIU commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xibinliu commented Jun 26, 2026

Uh oh!

Uh oh!

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shralex Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

RexBearIU Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

RexBearIU Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RexBearIU commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RexBearIU commented Jun 25, 2026 •

edited

Loading

codecov Bot commented Jun 25, 2026 •

edited

Loading

RexBearIU commented Jun 25, 2026 •

edited

Loading

RexBearIU Jun 30, 2026 •

edited

Loading

RexBearIU commented Jun 30, 2026 •

edited

Loading